Skip to content

Conversation

@fangchenli
Copy link
Member

@fangchenli fangchenli added Performance Memory or execution speed performance Arrow pyarrow functionality labels Jan 25, 2026
Copy link

@aaron-seq aaron-seq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach to add rechunking after concat for arrow-backed arrays is correct and addresses issue #42357 for performance improvements. The implementation with the threshold-based logic and configuration option is well-designed.

However, there are 13 test failures across multiple CI/CD jobs related to DataFrame attributes. The failures show "AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="data") are different" and "Attribute "dtype" are different". This suggests that the rechunking operation may be changing the dtype representation or other DataFrame attributes in unexpected ways.

The issue appears to be that when rechunking is applied, it may be altering dtype metadata or chunked array properties that tests are checking. You should investigate whether the .cast(pa_dtype) operation in line 1845 is preserving all necessary dtype attributes, particularly for complex dtype scenarios.

Recommendation: Debug the failing tests to understand exactly which attribute differences are occurring after rechunking, and ensure the rechunked array maintains full dtype compatibility with the original concatenated result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: concat of pyarrow string array does not rechunk

2 participants